{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Text Classification with Noisy Labels\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "Consider Using Datalab\n",
    "<br/>\n",
    "\n",
    "If you are interested in detecting a wide variety of issues in your text dataset, check out the [Datalab text tutorial](https://docs.cleanlab.ai/stable/tutorials/datalab/text.html). Datalab can detect many other types of data issues beyond label issues, whereas CleanLearning is a convenience method to handle noisy labels with sklearn-compatible classification models.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this 5-minute quickstart tutorial, we use cleanlab to find potential label errors in an intent classification dataset composed of (text) customer service requests at an online bank. We consider a subset of the [Banking77-OOS Dataset](https://arxiv.org/abs/2106.04564) containing 1,000 customer service requests which can be classified into 10 categories corresponding to the intent of the request. cleanlab will shortlist examples that confuse our ML model the most; many of which are potential label errors, out-of-scope examples, or otherwise ambiguous examples. cleanlab's `CleanLearning` class automatically detects and filters out such badly labeled data, in order to train a more robust version of any Machine Learning model. No change to your existing modeling code is required!\n",
    "\n",
    "\n",
    "**Overview of what we'll do in this tutorial:**\n",
    "\n",
    "- Define a ML model that can be trained on our dataset (here we use Logistic Regression applied to text embeddings from a pretrained Transformer network, you can use any text classifier model).\n",
    "\n",
    "- Use `CleanLearning` to wrap this ML model and compute out-of-sample predicted class probabilites, which allow us to identify potential label errors in the dataset.\n",
    "\n",
    "- Train a more robust version of the same ML model after dropping the detected label errors using `CleanLearning`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "Quickstart\n",
    "<br/>\n",
    "    \n",
    "Already have an sklearn compatible `model`, `data` and given `labels`? Run the code below to train your `model` and get label issues using `CleanLearning`. \n",
    "    \n",
    "You can subsequently use the same `CleanLearning` object to train a more robust model (only trained on the clean data) by calling the `.fit()` method and passing in the `label_issues` found earlier.\n",
    "\n",
    "\n",
    "<div  class=markdown markdown=\"1\" style=\"background:white;margin:16px\">  \n",
    "    \n",
    "```python\n",
    "\n",
    "from cleanlab.classification import CleanLearning\n",
    "\n",
    "cl = CleanLearning(model)\n",
    "label_issues = cl.find_label_issues(train_data, labels)  # identify mislabeled examples \n",
    "  \n",
    "cl.fit(train_data, labels, label_issues=label_issues)\n",
    "preds = cl.predict(test_data)  # predictions from a version of your model \n",
    "                               # trained on auto-cleaned data\n",
    "\n",
    "\n",
    "```\n",
    "    \n",
    "</div>\n",
    "    \n",
    "Is your model/data not compatible with `CleanLearning`? You can instead run cross-validation on your model to get out-of-sample `pred_probs`. Then run the code below to get label issue indices ranked by their inferred severity.\n",
    "\n",
    "\n",
    "<div  class=markdown markdown=\"1\" style=\"background:white;margin:16px\">  \n",
    "    \n",
    "```python\n",
    "\n",
    "from cleanlab.filter import find_label_issues\n",
    "\n",
    "ranked_label_issues = find_label_issues(\n",
    "    labels,\n",
    "    pred_probs,\n",
    "    return_indices_ranked_by=\"self_confidence\",\n",
    ")\n",
    "    \n",
    "\n",
    "```\n",
    "    \n",
    "</div>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Install required dependencies\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can use `pip` to install all packages required for this tutorial as follows:\n",
    "\n",
    "```ipython3\n",
    "!pip install sentence-transformers\n",
    "!pip install cleanlab\n",
    "# Make sure to install the version corresponding to this tutorial\n",
    "# E.g. if viewing master branch documentation:\n",
    "#     !pip install git+https://github.com/cleanlab/cleanlab.git\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "# Package installation (hidden on docs.cleanlab.ai).\n",
    "# If running on Colab, may want to use GPU (select: Runtime > Change runtime type > Hardware accelerator > GPU)\n",
    "# Package versions we used:scikit-learn==1.2.0 sentence-transformers==2.2.2\n",
    "\n",
    "dependencies = [\"cleanlab\", \"sentence_transformers\"]\n",
    "\n",
    "# Supress outputs that may appear if tensorflow happens to be improperly installed: \n",
    "import os \n",
    "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"  # disable parallelism to avoid deadlocks with huggingface\n",
    "\n",
    "if \"google.colab\" in str(get_ipython()):  # Check if it's running in Google Colab\n",
    "    %pip install cleanlab  # for colab\n",
    "    cmd = ' '.join([dep for dep in dependencies if dep != \"cleanlab\"])\n",
    "    %pip install $cmd\n",
    "else:\n",
    "    dependencies_test = [dependency.split('>')[0] if '>' in dependency \n",
    "                         else dependency.split('<')[0] if '<' in dependency \n",
    "                         else dependency.split('=')[0] for dependency in dependencies]\n",
    "    missing_dependencies = []\n",
    "    for dependency in dependencies_test:\n",
    "        try:\n",
    "            __import__(dependency)\n",
    "        except ImportError:\n",
    "            missing_dependencies.append(dependency)\n",
    "\n",
    "    if len(missing_dependencies) > 0:\n",
    "        print(\"Missing required dependencies:\")\n",
    "        print(*missing_dependencies, sep=\", \")\n",
    "        print(\"\\nPlease install them before running the rest of this notebook.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re \n",
    "import string \n",
    "import pandas as pd \n",
    "from sklearn.metrics import accuracy_score\n",
    "from sklearn.model_selection import train_test_split, cross_val_predict \n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sentence_transformers import SentenceTransformer\n",
    "\n",
    "from cleanlab.classification import CleanLearning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "# This cell is hidden from docs.cleanlab.ai \n",
    "\n",
    "import random \n",
    "import numpy as np \n",
    "\n",
    "pd.set_option(\"display.max_colwidth\", None) \n",
    "\n",
    "SEED = 123456 # for reproducibility \n",
    "\n",
    "np.random.seed(SEED)\n",
    "random.seed(SEED)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Load and format the text dataset\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.read_csv(\"https://s.cleanlab.ai/banking-intent-classification.csv\")\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_texts, raw_labels = data[\"text\"].values, data[\"label\"].values\n",
    "\n",
    "raw_train_texts, raw_test_texts, raw_train_labels, raw_test_labels = train_test_split(raw_texts, raw_labels, test_size=0.1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_classes = len(set(raw_train_labels))\n",
    "\n",
    "print(f\"This dataset has {num_classes} classes.\")\n",
    "print(f\"Classes: {set(raw_train_labels)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's print the first example in the train set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "i = 0\n",
    "print(f\"Example Label: {raw_train_labels[i]}\")\n",
    "print(f\"Example Text: {raw_train_texts[i]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data is stored as two numpy arrays for each the train and test set:\n",
    "\n",
    "1. `raw_train_texts` and `raw_test_texts` store the customer service requests utterances in text format\n",
    "2. `raw_train_labels` and `raw_test_labels` store the intent categories (labels) for each example\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we need to perform label enconding on the labels, cleanlab's functions require the labels for each example to be an interger integer in 0, 1, …, num_classes - 1. We will use sklearn's `LabelEncoder` to encode our labels.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "encoder = LabelEncoder()\n",
    "encoder.fit(raw_train_labels)\n",
    "\n",
    "train_labels = encoder.transform(raw_train_labels)\n",
    "test_labels = encoder.transform(raw_test_labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "Bringing Your Own Data (BYOD)?\n",
    "\n",
    "You can easily replace the above with your own text dataset, and continue with the rest of the tutorial.\n",
    "\n",
    "Your classes (and entries of `train_labels` / `test_labels`) should be represented as integer indices 0, 1, ..., num_classes - 1.\n",
    "For example, if your dataset has 7 examples from 3 classes, `train_labels` might be: `np.array([2,0,0,1,2,0,1])`\n",
    "\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we convert the text strings into vectors better suited as inputs for our ML model. \n",
    "\n",
    "We will use numeric representations from a pretrained Transformer model as embeddings of our text. The [Sentence Transformers](https://huggingface.co/docs/hub/sentence-transformers) library offers simple methods to compute these embeddings for text data. Here, we load the pretrained `electra-small-discriminator` model, and then run our data through network to extract a vector embedding of each example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "transformer = SentenceTransformer('google/electra-small-discriminator')\n",
    "\n",
    "train_texts = transformer.encode(raw_train_texts)\n",
    "test_texts = transformer.encode(raw_test_texts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our subsequent ML model will directly operate on elements of `train_texts` and `test_texts` in order to classify the customer service requests."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Define a classification model and use cleanlab to find potential label errors\n",
    "\n",
    "<a id=\"section3\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A typical way to leverage pretrained networks for a particular classification task is to add a linear output layer and fine-tune the network parameters on the new data. However this can be computationally intensive. Alternatively, we can freeze the pretrained weights of the network and only train the output layer without having to rely on GPU(s). Here we do this conveniently by fitting a scikit-learn linear model on top of the extracted embeddings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = LogisticRegression(max_iter=400)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can define the `CleanLearning` object with our Logistic Regression model and use `find_label_issues` to identify potential label errors.\n",
    "\n",
    "`CleanLearning` provides a wrapper class that can easily be applied to any scikit-learn compatible model, which can be used to find potential label issues and train a more robust model if the original data contains noisy labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cv_n_folds = 5  # for efficiency; values like 5 or 10 will generally work better\n",
    "\n",
    "cl = CleanLearning(model, cv_n_folds=cv_n_folds)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "label_issues = cl.find_label_issues(X=train_texts, labels=train_labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `find_label_issues` method above will perform cross validation to compute out-of-sample predicted probabilites for each example, which is used to identify label issues.\n",
    "\n",
    "This method returns a dataframe containing a label quality score for each example. These numeric scores lie between 0 and 1, where  lower scores indicate examples more likely to be mislabeled. The dataframe also contains a boolean column specifying whether or not each example is identified to have a label issue (indicating it is likely mislabeled). Note that the given and predicted labels here are encoded as intergers as that was the format expected by `cleanlab`, we will inverse transform them later in this tutorial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "label_issues.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can get the subset of examples flagged with label issues, and also sort by label quality score to find the indices of the 10 most likely mislabeled examples in our dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "identified_issues = label_issues[label_issues[\"is_label_issue\"] == True]\n",
    "lowest_quality_labels = label_issues[\"label_quality\"].argsort()[:10].to_numpy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\n",
    "    f\"cleanlab found {len(identified_issues)} potential label errors in the dataset.\\n\"\n",
    "    f\"Here are indices of the top 10 most likely errors: \\n {lowest_quality_labels}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's review some of the most likely label errors. To help us inspect these datapoints, we define a method to print any example from the dataset, together with its given (original) label and the suggested alternative label from cleanlab.\n",
    "\n",
    "We then display some of the top-ranked label issues identified by cleanlab:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def print_as_df(index):\n",
    "    return pd.DataFrame(\n",
    "        {\n",
    "            \"text\": raw_train_texts, \n",
    "            \"given_label\": raw_train_labels,\n",
    "            \"predicted_label\": encoder.inverse_transform(label_issues[\"predicted_label\"]),\n",
    "        },\n",
    "    ).iloc[index]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print_as_df(lowest_quality_labels[:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These are very clear label errors that cleanlab has identified in this data! Note that the `given_label` does not correctly reflect the intent of these requests, whoever produced this dataset made many mistakes that are important to address before modeling the data.\n",
    "\n",
    "cleanlab has shortlisted the most likely label errors to speed up your data cleaning process. With this list, you can decide whether to fix these label issues or remove ambiguous examples from the dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Train a more robust model from noisy labels\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fixing the label issues manually may be time-consuming, but cleanlab can filter these noisy examples and train a model on the remaining clean data for you automatically.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To establish a baseline, let's first train and evaluate our original Logistic Regression model.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "baseline_model = LogisticRegression(max_iter=400)  # note we first re-instantiate the model\n",
    "baseline_model.fit(X=train_texts, y=train_labels)\n",
    "\n",
    "preds = baseline_model.predict(test_texts)\n",
    "acc_og = accuracy_score(test_labels, preds)\n",
    "print(f\"\\n Test accuracy of original model: {acc_og}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have a baseline, let's check if using `CleanLearning` improves our test accuracy.\n",
    "\n",
    "`CleanLearning` provides a wrapper that can be applied to any scikit-learn compatible model. The resulting model object can be used in the same manner, but it will now train more robustly if the data has noisy labels.\n",
    "\n",
    "We can use the same `CleanLearning` object defined above, and  pass the label issues we already computed into `.fit()` via the `label_issues` argument. This accelerates things; if we did not provide the label issues, then they would be recomputed via cross-validation. After that `CleanLearning` simply deletes the examples with label issues and retrains your model on the remaining data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "cl.fit(X=train_texts, labels=train_labels, label_issues=cl.get_label_issues())\n",
    "\n",
    "pred_labels = cl.predict(test_texts)\n",
    "acc_cl = accuracy_score(test_labels, pred_labels)\n",
    "print(f\"Test accuracy of cleanlab's model: {acc_cl}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that the test set accuracy slightly improved as a result of the data cleaning. Note that this will not always be the case, especially when we are evaluating on test data that are themselves noisy. The best practice is to run cleanlab to identify potential label issues and then manually review them, before blindly trusting any accuracy metrics. In particular, the most effort should be made to ensure high-quality test data, which is supposed to reflect the expected performance of our model during deployment.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Spending too much time on data quality?\n",
    "\n",
    "Using this open-source package effectively can require significant ML expertise and experimentation, plus handling detected data issues can be cumbersome.\n",
    "\n",
    "That’s why we built [Cleanlab Studio](https://cleanlab.ai/blog/data-centric-ai/) -- an automated platform to find **and fix** issues in your dataset, 100x faster and more accurately.  Cleanlab Studio automatically runs optimized data quality algorithms from this package on top of cutting-edge AutoML & Foundation models fit to your data, and helps you fix detected issues via a smart data correction interface. [Try it](https://cleanlab.ai/) for free!\n",
    "\n",
    "<p align=\"center\">\n",
    "  <img src=\"https://raw.githubusercontent.com/cleanlab/assets/master/cleanlab/ml-with-cleanlab-studio.png\" alt=\"The modern AI pipeline automated with Cleanlab Studio\">\n",
    "</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "# Note: This cell is only for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.\n",
    "\n",
    "highlighted_indices = [646, 390, 628, 702]  # check these examples were found in find_label_issues\n",
    "if not all(x in identified_issues.index for x in highlighted_indices):\n",
    "    raise Exception(\"Some highlighted examples are missing from ranked_label_issues.\")\n",
    "\n",
    "# Also check that cleanlab has improved prediction accuracy\n",
    "if acc_og >= acc_cl:\n",
    "    raise Exception(\"Cleanlab training failed to improve model accuracy.\")"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "Text x TensorFlow",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
